About the project

Right now i am feeling great becuase i have started something really new. I knew R exist and we could do lot of statistical analyses and data visulaization with it, but working with R and github togther is very amazing. From this course i am expecting to expand my basics of R with some new techniques on data visulaization. I got the information about this course from my colleague.

Link to my github repository is here :https://github.com/Hem-web/IODS-project


chapter2: Regression and model validation

Reading the data

setwd("~/IODS-project")
learning2014 <- read.table("~/IODS-project/data/learning2014.txt") 
str(learning2014) 
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014) 
## [1] 166   7

Comment on structure and dimension of the data

  • The data frame contains 7 variables with 166 observations where age and points are presented as integers and attitude, deep, stra and surf are presented as numbers.
  • Gender is presented with two factor levels, male (M) and female (F).

Exploring the data

pairs(learning2014[!names(learning2014) %in% c("gender")],col=learning2014$gender)

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
ggpairs(learning2014, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

Interpreting the data

  • If we look at the main daiganol (skewed distribution) and first column (histograms), among all variables age of the gender is highly skewed whereas stra is skewed very minimal.
  • High skeweness in age is also visible from box plots as it shows many outliers above whiskers (error bars). Minimal skweness in stra variable could be also seen in the box plot of stra where medians for both genders lies almost at the center of the box.
  • Highest correlation is seen between attitude and points; Cor: 0.4365245 .
  • Lowest correlation is seen between deep and surf; Cor: -0.323802 .

Linear Regression

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

Running the regression model with all five independent variables, attitude, surf, stra, age, and deep to identify the strongest impact of them on points.

my_model <- lm(points ~ attitude + surf + deep + stra + age, data = learning2014)
results <- summary(my_model)

knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.301 5.241 3.492 0.001
attitude 3.432 0.570 6.022 0.000
surf -1.105 0.840 -1.316 0.190
deep -1.044 0.782 -1.335 0.184
stra 0.966 0.539 1.791 0.075
age -0.097 0.053 -1.809 0.072

After running the model with five explanatory variables, we observed that the strongest effect on score points was due to attitude towards statistics. Nevertheless, let’s remove stra, age, and deep and run the model again to see if including surf in the model makes any changes.

my_model1 <- lm(points ~ attitude + surf, data = learning2014)
results <- summary(my_model1)

knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.120 3.127 4.515 0.000
attitude 3.426 0.576 5.944 0.000
surf -0.779 0.796 -0.979 0.329

It seems that even after removing those three variables, the statistical paramters of surface learning appproach did not changed and remained statistical insignificant suggesting that attitude is only the explanatory variable that impacted the points mostly. So, let’s run the model again with only attitude as an explanatory variable.

my_model3 <- lm(points ~ attitude, data = learning2014)
results <- summary(my_model3)

knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.637 1.830 6.358 0
attitude 3.525 0.567 6.214 0

The final model summay suggest that as intercept and attitude are statistically singnificant. The attitude estimate 3.525 means that in every 1 level increase of attiutde the point score will be increased by 3.525 whereas intercept of 11.637 means that people with zero attitude can still get 11.637 points. The Coefficient of determination \(R^2\) = 0.1905537 indicates that contribution of attitude towards scoring point is only 19%. This suggest that there must be other explanatory variables which could influence the points.

Diagnostic plots

plot(my_model3, which=c(1,2,5))

  1. Interpretation on Diagnostic plot
  • Linear regression model is based on mainly two assumption, first, the linearity of the target variable is assumed to be linear and second, errors in the model are normally distributed with no correlation, contant variance and independent of expalanotry variables.
  • Based on QQ plots we can say the our errors are close to normally distributed as dots falls closer to the line of fit.
  • The uniform distribution of residuals vs model prediction, suggest that error size is constant and there is no effect of explanotry variable on it.
  • The Residual vs leverage plot assist in identifying the extreme outliers in the explanotry variables. Any sort of trend on that plot will point towards an outlier. Since we do not see such trend in Residual vs leverage plot we can say that there was a regular leverage in out model.
  • As our model was well fitted with the assumptions, we can conclude that our model is valid.

chapter3:Logistic Regression

Reading the data

setwd("~/IODS-project")
alc <- read.table("~/IODS-project/data/pormath.txt") 
names(alc) 
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Explanation of the data

  • The data have 282 observation and 35 variables. There are 17 variables with factoral level, one with numeric label,one with logical and rest with interger type.

Development of hypothesis based on selected variables

  • There are many interesting variables in the data which could strongly link to the alcohol consumption rate.However, I find followings varibales quite interesting and propose my personal hypothesis for each of them in reagrds with alcohol consumption.

    • grades: Having low grades already in the first period could encourage alcohol consumption becuase low grades generally perceieved as failure.
    • freetime: More freetime after school could discourage alcohol consumption becuase it will assist on easing the stress that have likely accumulated during shool hours.
    • studytime: More study hours demands more focus. This will likely induce the feelings of having an extra mode of refreshment, therefore could encourage alcohol consumption
    • absences:Absences may are results of stress related things, not keen on studying etc. Absences in school means that student spent their times also doing something else than just studying. Therefore, more absences in school could be a sign of spending time for drinking alcohol.

Exploring the distribution of selected variables and their relationships with alcohol consumption

  • To explore the distribution of selected variables lets first install the following packages from library.
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#For graphical distribution I will use box plots of variables against the alcohol consumptions between male and female students. For that let's run following script
g1 <- ggplot(alc, aes(x = high_use, y = G1))
g1 + geom_boxplot() + ylab("G1")

g2 <- ggplot(alc, aes(x = high_use, y = freetime))
g2 + geom_boxplot() + ylab("freetime")

g3 <- ggplot(alc, aes(x = high_use, y = studytime))
g3 + geom_boxplot() + ylab("studytime")

g3 <- ggplot(alc, aes(x = high_use, y = absences))
g3 + geom_boxplot() + ylab("absences")

  • Students grades in first period seems to have an effect (perhaps minor) alcohol consumption as box plots of having (TRUE) or not having (FALSE) alcohol vary in length.
  • There seeems no effect of having freetime after school in alcohol consumption as the box plot sizes for having (TRUE) or not having (FALSE) alcohol are same.
  • The studytime also seems to have an effect on alcohol consumption as box plots of having (TRUE) or not having (FALSE) alcohol vary in length.
  • absences in school seems to have an higher variation on alchol consumption as there are many outliers above the whiskers.
  • My personal hyopthesis are pretty close to these results except the effect of freetime.

Performing regression analyses on selected variables and alcohol consumption

m <- glm(high_use ~ G1 + freetime + studytime + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G1 + freetime + studytime + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.885  -0.828  -0.634   1.110   2.129  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.49537    0.72601  -0.682 0.495036    
## G1          -0.08391    0.04591  -1.828 0.067614 .  
## freetime     0.33103    0.12468   2.655 0.007929 ** 
## studytime   -0.45144    0.15958  -2.829 0.004671 ** 
## absences     0.07957    0.02245   3.544 0.000395 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.08  on 377  degrees of freedom
## AIC: 433.08
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)          G1    freetime   studytime    absences 
## -0.49536973 -0.08390687  0.33102562 -0.45144113  0.07956876
OR <- coef(m) %>% exp
CI<-confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR       2.5 %       97.5 %
## (Intercept) 0.6093456 -1.92821285  0.925191583
## G1          0.9195169 -0.17508225  0.005453013
## freetime    1.3923955  0.08987902  0.579750936
## studytime   0.6367099 -0.77323579 -0.146089296
## absences    1.0828200  0.03763152  0.126082337
  • Summary of the fitted model: Significance level of freetime, studytime, and absences suggest that they have a significant impact on alcohol consumption. .
  • Coefficients of the model as odds ratios: Looking into confidence interval, in majority of poulation (i.e 97.5 %) Grade scores, freetime and absences have affected the alcohol consumption.
  • Comparing to your previously stated hypothesis:

Exploring the predictive power of the model

# fit the model
m <- glm(high_use ~ G1 + freetime + absences, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = high_use)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = high_use, y = probability))

# define the geom as points and draw the plot
g + geom_point() 

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   268    0
##    TRUE      0  114
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293

Chapter4: Clustering and Classification

Reading the data

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

  • Distribution of the variables:Looking in the both distribution figures all the variables are not homogenously distributed except *rm (average no of room per dewelling).
  • Relationship between the variables:In the data rad(accessibility to radial highways) and tax shows a strong positive association as seen in the “Distribution2”. On the other hand in the same figure a similar but negative association is also seen between Istat and medv.

Standarization of the data

  • During standarization of the data we scale the variables so that their mean changes to zero and standard deviation is converted to 1.
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
## [1] "matrix"
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
## crime
##      low  med_low med_high     high 
##      127      126      126      127
## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
  • After scaling the variables, the mean value of each has now changed to zero.
  • During standarization of data, we also created a categorical variable crime and uses the quantiles as break points and saved the scaled data as data frame.
  • Following this we also divided the dataset into train dataset and test dataset so that 80 % o data belongs to first dataset. In test dataset variable crime is excluded just becuase that this variable will be used to define the classes that will help to generalize our results from train dataset.
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  0.7991 0.3703 -0.4872 -0.4872 0.0487 ...
##  $ indus  : num  -0.905 -1.138 0.247 1.231 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -1.093 -0.965 -1.016 2.73 -0.265 ...
##  $ rm     : num  0.154 2.185 -0.584 -1.564 -0.388 ...
##  $ age    : num  -2.1591 -0.1448 -1.1359 0.8961 -0.0702 ...
##  $ dis    : num  1.539 0.427 0.336 -1.076 0.838 ...
##  $ rad    : num  -0.408 -0.522 -0.522 -0.522 -0.522 ...
##  $ tax    : num  -0.6422 -1.1406 -0.0607 -0.0311 -0.5769 ...
##  $ ptratio: num  -0.857 -1.642 0.113 -1.735 -1.504 ...
##  $ black  : num  0.19756 0.33557 0.43141 0.00346 0.42638 ...
##  $ lstat  : num  -1.0451 -1.2453 -0.4976 2.1939 -0.0312 ...
##  $ medv   : num  0.1269 2.4863 -0.2428 -0.5146 0.0399 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 1 2 3 2 1 3 2 3 2 ...
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 0.0487 -0.4872 -0.4872 -0.4872 ...
##  $ indus  : num  -1.306 -0.476 -0.437 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.834 -0.265 -0.144 -0.144 -0.144 ...
##  $ rm     : num  1.227 -0.392 -0.203 -0.513 -0.671 ...
##  $ age    : num  -0.511 0.509 0.822 0.907 0.772 ...
##  $ dis    : num  1.0767 1.1548 0.0864 0.2871 0.4212 ...
##  $ rad    : num  -0.752 -0.522 -0.637 -0.637 -0.637 ...
##  $ tax    : num  -1.105 -0.577 -0.601 -0.601 -0.601 ...
##  $ ptratio: num  0.113 -1.504 1.175 1.175 1.175 ...
##  $ black  : num  0.441 0.441 0.441 0.412 0.221 ...
##  $ lstat  : num  -1.0255 0.0864 0.8496 0.5107 0.302 ...
##  $ medv   : num  1.486 -0.395 -0.797 -0.754 -0.645 ...

LDA

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2574257 0.2425743 0.2549505 
## 
## Group means:
##                   zn      indus         chas        nox         rm        age
## low       0.88656059 -0.9500661 -0.113254311 -0.8941867  0.4790775 -0.8875667
## med_low  -0.07970125 -0.2886499  0.030524797 -0.5734982 -0.1438164 -0.3285724
## med_high -0.38135991  0.2509327  0.249939331  0.4313912  0.1588933  0.4015313
## high     -0.48724019  1.0170891 -0.004759149  1.1021338 -0.3722054  0.8355665
##                 dis        rad        tax     ptratio       black        lstat
## low       0.8352837 -0.6837338 -0.7216274 -0.47215981  0.37784670 -0.777511180
## med_low   0.3729391 -0.5489875 -0.5030656 -0.03764304  0.31637034 -0.114571117
## med_high -0.4275880 -0.4064658 -0.2800958 -0.28629788  0.07591514 -0.001672131
## high     -0.8535667  1.6384176  1.5142626  0.78111358 -0.86788506  0.917481402
##                 medv
## low       0.53711700
## med_low  -0.02228111
## med_high  0.21245039
## high     -0.77892717
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11843430  0.52726542 -0.90061486
## indus    0.08170681 -0.45334569  0.47348615
## chas    -0.07785526 -0.01669165  0.09913657
## nox      0.39441487 -0.84641909 -1.21143964
## rm      -0.08824416 -0.08412197 -0.23679190
## age      0.24209266 -0.24851742  0.01581911
## dis     -0.08839995 -0.15530994  0.33526861
## rad      3.16012822  0.91407398  0.17770450
## tax     -0.02466061  0.17441209  0.04616831
## ptratio  0.11329583 -0.06834929 -0.17943083
## black   -0.12064856  0.01632784  0.09834896
## lstat    0.21940269 -0.15678505  0.36454540
## medv     0.18921628 -0.47055295 -0.16070586
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9472 0.0402 0.0126

##   [1] med_low  med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##   [9] med_low  med_low  med_low  med_low  low      low      med_low  med_low 
##  [17] med_low  med_low  low      med_low  low      med_high med_low  med_low 
##  [25] med_low  med_high med_high med_high med_high med_high med_high med_high
##  [33] med_high med_high med_high med_low  low      med_low  med_high low     
##  [41] low      low      low      low      med_low  med_low  med_low  med_high
##  [49] med_low  low      med_low  med_high med_high low      low      low     
##  [57] low      med_low  low      med_low  med_low  med_low  med_low  med_low 
##  [65] med_low  med_low  med_low  low      low      low      low      low     
##  [73] low      high     high     high     high     high     high     high    
##  [81] high     high     high     high     high     high     high     high    
##  [89] high     high     high     high     high     high     high     high    
##  [97] high     high     med_high med_high med_high med_low 
## Levels: low med_low med_high high

Distance measurement

library(MASS)
data('Boston')
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled<-as.data.frame(boston_scaled)
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
boston_scaled<-as.matrix(boston_scaled, method= "manhattan") # saving the scaled data under manhattan matrix.
dist_man <- dist(boston_scaled)
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-means clustering

# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

+ A abrupt drop in the total within-cluster sum of squares (WCSS) happneded when the number of clusters (x-axis) reached around 2 suggesting that optimum no of clusters is most likely 2 or less. No change in WCSS even after increasing the number of clusters upt 10 also suggest result produced using 2 or less number of cluster could be the similar.

Bonus

library(MASS)
data(Boston) # load the Boston data set
boston_scaled1<- scale(Boston)# scale the Boston data set again - named boston_scaled1
boston_scaled2<-as.data.frame(boston_scaled1)
km1 <-kmeans(Boston, centers = 3)# k-means clustering
cluster <- km1$cluster
boston_scaled2<- data.frame(boston_scaled1,cluster) # add the cluster number to the df
lda.fit<- lda(cluster~., data = boston_scaled2)#linear discriminant analysis of clusters vs. all other variables
lda.fit# print the lda.fit object
## Call:
## lda(cluster ~ ., data = boston_scaled2)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.5296443 0.1996047 0.2707510 
## 
## Group means:
##         crim          zn      indus          chas        nox         rm
## 1 -0.3920779  0.27670879 -0.6513071  0.0214843827 -0.6152775  0.2573427
## 2 -0.3293317 -0.07332724  0.2818828  0.0005392655  0.2816899 -0.1453417
## 3  1.0097765 -0.48724019  1.0662784 -0.0424254043  0.9959393 -0.3962652
##          age        dis        rad         tax    ptratio       black
## 1 -0.4572006  0.5121870 -0.6013344 -0.78136288 -0.2690134  0.34109296
## 2  0.1822823 -0.2378455 -0.5418150 -0.01444889 -0.3768823  0.07010933
## 3  0.7599946 -0.8265965  1.5757732  1.53915759  0.8040926 -0.71893398
##         lstat        medv
## 1 -0.43621538  0.36234147
## 2  0.01371321 -0.03812375
## 3  0.84321670 -0.68070813
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## crim     0.048210477  0.05079118
## zn       0.253528315  0.06311589
## indus    0.369497254  0.12674727
## chas    -0.047064817  0.01998369
## nox     -0.063156250 -0.49621758
## rm      -0.005144383  0.09537352
## age     -0.118710969  0.05412142
## dis     -0.385151599  0.17969944
## rad      1.996321584  3.05733525
## tax      4.535785039 -2.77688761
## ptratio  0.122064688  0.19196217
## black   -0.029200518  0.06353722
## lstat    0.085030308  0.12666624
## medv     0.157444662 -0.10356584
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9812 0.0188
#the function for lda biplot
lda.arrows<-function(x, myscale=1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes1 <- as.numeric(boston_scaled2$cluster)# target classes as numeric
plot(lda.fit, dimen = 2, col = classes1, pch = classes1, main = "LDA biplot using three clusters 1, 2 and 3")# plot the lda results
lda.arrows(lda.fit, myscale = 2)

chapter5: Dimensionality reduction techniques

Reading the data

knitr::opts_chunk$set(echo = FALSE)
setwd("~/IODS-project")
human <- read.table("~/IODS-project/data/human.txt")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ secedu.R    : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labour.ratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ expect.edu  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ life.expB   : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI.capita  : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MMR         : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adols.BR    : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ par.percent : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8
#Read the required packages
library(MASS)
library(dplyr)
library(ggplot2)
library(GGally)

Summary and Distribution of the data

##     secedu.R       labour.ratio      expect.edu      life.expB    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##    GNI.capita          MMR            adols.BR       par.percent   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

  1. Distribution and Relationship between variables:
  • Among all, expected years of schooling (expec.edu) is showing a homogenous distribution. However, secondary education ratio (secedu.R) and par.percent are also somewhat close to normal distribution.
  • Life expectancy at birth (life.expB) and expec.edu showed a strongest positive relationship among all suggesting that people expected to live during birth recevive higher education. Following this maternal mortality rate (MMR) and adolescent birth rate (adols.BR), gross national income per capita (GNI.capita) and expec.edu, and , life.expB* and secedu.R possess the strongest correlation.
  • In contrast, MMR and life.expB showed a strongest negative correlation among all indication that greater the MMr higher the risk of losing life of new born.
  • There seems to be no relationship between secedu.R and labour.ratio at all.

PCA on non-standarized data

PCA on standarized data: Way I

##     secedu.R        labour.ratio       expect.edu        life.expB      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##    GNI.capita           MMR             adols.BR        par.percent     
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

  1. Is there any difference between PCA’s of non-standarized and standarized?
  • Yes, there is difference between these two PCA’s. In biplot of PCA with non-standarized data we see only one arrow showing GNI per capita. While in biplot of PCA with standarized data we can see many arrows showing several HDI indices pointing towards two PC’s.In order to minimize the emphasis of PCA on variables having higher variances (squared deviation from the mean is used) we must standarized the data before PCA. For e.g if there are two variables one with high values e.g in our case GNI which is in 1000’s of dollar whereas other variables such as MMR, secedu.R and so on are relatively in smaller value. So, if we don not standardized such data the PCA will give high weight to the feature having larger variance e.g to GNI, as in seen in non-standarized biplot. Therefore, inorder to provied fair comparison between the explained variance in the dataset we need to standardize the data set. As a result we can possible see many variables linked to two main PC’s.
  1. Personal interpretation on first two PC’s of PCA from standarized data.
  • In the biplot of standarized data, pC1 explains the almost 54 % of varinace in the data set which is likely linked with the MMR and dolsescent birth rate.
  • On the other hand,PC2 explains about 16 % of variance inthe data set which is then associated with expect.edu, GNI.capita, escedu.R, and life.expB.

Biplot on standarized data set- Way II

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Multiple correspondance analysis (MCA)

## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Visulaization of the tea data

MCA of the tea data (full dataset)

## 
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = 20:36) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.148   0.122   0.090   0.078   0.074   0.071   0.068
## % of var.              9.885   8.103   6.001   5.204   4.917   4.759   4.522
## Cumulative % of var.   9.885  17.988  23.989  29.192  34.109  38.868  43.390
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.065   0.062   0.059   0.057   0.054   0.052   0.049
## % of var.              4.355   4.123   3.902   3.805   3.628   3.462   3.250
## Cumulative % of var.  47.745  51.867  55.769  59.574  63.202  66.664  69.914
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.048   0.047   0.046   0.040   0.038   0.037   0.036
## % of var.              3.221   3.127   3.037   2.683   2.541   2.438   2.378
## Cumulative % of var.  73.135  76.262  79.298  81.982  84.523  86.961  89.339
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27
## Variance               0.035   0.031   0.029   0.027   0.021   0.017
## % of var.              2.323   2.055   1.915   1.821   1.407   1.139
## Cumulative % of var.  91.662  93.717  95.633  97.454  98.861 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.541  0.658  0.143 | -0.149  0.061  0.011 | -0.306  0.347
## 2             | -0.361  0.293  0.133 | -0.078  0.017  0.006 | -0.633  1.483
## 3             |  0.073  0.012  0.003 | -0.169  0.079  0.018 |  0.246  0.224
## 4             | -0.572  0.735  0.235 |  0.018  0.001  0.000 |  0.203  0.153
## 5             | -0.253  0.144  0.079 | -0.118  0.038  0.017 |  0.006  0.000
## 6             | -0.684  1.053  0.231 |  0.032  0.003  0.001 | -0.018  0.001
## 7             | -0.111  0.027  0.022 | -0.182  0.090  0.059 | -0.207  0.159
## 8             | -0.210  0.099  0.043 | -0.068  0.013  0.004 | -0.421  0.655
## 9             |  0.118  0.031  0.012 |  0.229  0.144  0.044 | -0.538  1.070
## 10            |  0.258  0.150  0.045 |  0.478  0.627  0.156 | -0.482  0.861
##                 cos2  
## 1              0.046 |
## 2              0.409 |
## 3              0.038 |
## 4              0.030 |
## 5              0.000 |
## 6              0.000 |
## 7              0.077 |
## 8              0.174 |
## 9              0.244 |
## 10             0.158 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.166  0.495  0.025  2.756 | -0.166  0.607  0.026 -2.764 |
## Not.breakfast | -0.153  0.457  0.025 -2.756 |  0.154  0.560  0.026  2.764 |
## Not.tea time  | -0.498  4.053  0.192 -7.578 |  0.093  0.174  0.007  1.423 |
## tea time      |  0.386  3.142  0.192  7.578 | -0.072  0.135  0.007 -1.423 |
## evening       |  0.319  1.307  0.053  3.985 | -0.058  0.053  0.002 -0.728 |
## Not.evening   | -0.167  0.683  0.053 -3.985 |  0.030  0.028  0.002  0.728 |
## lunch         |  0.659  2.385  0.075  4.722 | -0.390  1.018  0.026 -2.793 |
## Not.lunch     | -0.113  0.410  0.075 -4.722 |  0.067  0.175  0.026  2.793 |
## dinner        | -0.661  1.146  0.033 -3.136 |  0.796  2.025  0.048  3.774 |
## Not.dinner    |  0.050  0.086  0.033  3.136 | -0.060  0.152  0.048 -3.774 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.483  6.900  0.215 -8.017 |
## Not.breakfast  0.445  6.369  0.215  8.017 |
## Not.tea time   0.265  1.886  0.054  4.027 |
## tea time      -0.205  1.462  0.054 -4.027 |
## evening        0.451  4.312  0.106  5.640 |
## Not.evening   -0.236  2.254  0.106 -5.640 |
## lunch          0.301  0.822  0.016  2.160 |
## Not.lunch     -0.052  0.141  0.016 -2.160 |
## dinner         0.535  1.235  0.022  2.537 |
## Not.dinner    -0.040  0.093  0.022 -2.537 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.025 0.026 0.215 |
## tea.time      | 0.192 0.007 0.054 |
## evening       | 0.053 0.002 0.106 |
## lunch         | 0.075 0.026 0.016 |
## dinner        | 0.033 0.048 0.022 |
## always        | 0.045 0.001 0.101 |
## home          | 0.005 0.000 0.134 |
## work          | 0.112 0.043 0.005 |
## tearoom       | 0.372 0.022 0.008 |
## friends       | 0.243 0.015 0.103 |
## 
## Supplementary categories (the 10 first)
##                  Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3   cos2
## F             |  0.151  0.033  3.158 | -0.109  0.017 -2.278 | -0.048  0.003
## M             | -0.221  0.033 -3.158 |  0.159  0.017  2.278 |  0.070  0.003
## employee      | -0.153  0.006 -1.313 | -0.151  0.006 -1.289 |  0.103  0.003
## middle        | -0.030  0.000 -0.205 |  0.336  0.017  2.281 | -0.284  0.012
## non-worker    | -0.036  0.000 -0.324 |  0.185  0.009  1.666 | -0.291  0.023
## other worker  |  0.040  0.000  0.187 |  0.013  0.000  0.061 | -0.063  0.000
## senior        |  0.415  0.023  2.608 |  0.072  0.001  0.452 | -0.187  0.005
## student       |  0.032  0.000  0.305 | -0.317  0.031 -3.022 |  0.394  0.047
## workman       | -0.417  0.007 -1.473 |  0.249  0.003  0.878 |  0.343  0.005
## Not.sportsman | -0.030  0.001 -0.426 |  0.018  0.000  0.260 | -0.051  0.002
##               v.test  
## F             -0.998 |
## M              0.998 |
## employee       0.884 |
## middle        -1.928 |
## non-worker    -2.620 |
## other worker  -0.289 |
## senior        -1.177 |
## student        3.760 |
## workman        1.209 |
## Not.sportsman -0.721 |
## 
## Supplementary categorical variables (eta2)
##                    Dim.1 Dim.2 Dim.3  
## sex              | 0.033 0.017 0.003 |
## SPC              | 0.032 0.053 0.076 |
## Sport            | 0.001 0.000 0.002 |
## age_Q            | 0.008 0.077 0.146 |
## frequency        | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality     | 0.005 0.000 0.016 |
## healthy          | 0.000 0.000 0.008 |
## diuretic         | 0.004 0.000 0.013 |
## friendliness     | 0.071 0.001 0.013 |
## 
## Supplementary continuous variable
##                  Dim.1    Dim.2    Dim.3  
## age           |  0.042 |  0.204 | -0.340 |
  1. MCA result Interpretation:
  • First of all it is important to know that MCA is mainly used for categorical variables as we have in tea dataset.
  • However, continous variable can be used as a background variable.
  • The summary of mca showed decending order of variance % explained (as Eigenvalues) by each dimensions. For e.g Dim1 has explained close to 10 % of variance while Dim27 explained only 1% of variance.
  • Following that, summary result showed the contribution (ctr) of each indivial on producing dimensions. And we can say that Dim1 and Dim2 are mainly influenced by individual 6 and 10 respectively.
  • On the other hand cos2 represent the quality of individual on dimensions. If cos2 is closer to 1 then we can say that individual is well projected to the dimensions and in our case individual 4 and 10 are well representing the Dim1 and Dim2 respectively.
  • MCA has shown the significance of active categorical variables with respect to zero as a V-test. If v-test is between -2 to 2 then categories as coordinate in not significantly different than zero, if v-test is >2 if categories is significantly greater than zero and is < -2 if categories is significantly less than zero. From this, we can say that for Dim1 breakfast ,tea time,evening, lunch, and Not.dinner categories are greater than zero. Whereas, same hold true with Not.breakfast, Not.lunch and dinner Dim2.
  • Following that, MCA has also given the influence of each categorical variable sin dimensions as (eta2), which is similar of ANOVA test. The value close to 1 suggest that there is strong link between dimensions and categorical variables. So we can say that tearoom, tea.time and fiends are strongly linked with Dim1. In contrast there is no such strong relationship between the varaibles and Dim2.
  • There are also result for supplementary categories similar to categorical categories However they do not contribute to the construction of dimensions.
  • Lastly, MCA has give the continious variable of the dataset and its relationship with the dimensions.

Exploring various MCA plotting options

  1. Comment on MCA plots:
  • In the first plot under this title, i have reduced the font sizes by adding cex command and also gave the command to plot the 20 most contributing individual to dimensions.Also i changed the color of less important variables to dark grey.

  • In second plot, I plotted the quality of individuals (cos2 greater than 0.1) on dimensions only for 10 most contributing variables as shown by dark blue color.

  • In third biplot, different color is assigned for individual variables using habillage command. As a group I used the color for frequency on drinking tea.

  • Similar to second biplot, fourth biplot represent the individuals by their cos2 values.